Integration performance optimization: de-broadcast dipole integration #258

Lazersmoke · 2024-04-30T17:44:04Z

Broadcasting can be (very) slow compared to the for loops this PR introduces, especially in cases where the size of the array is something like 1x1x1x1 (requires 4 dimensions of broadcasting codegen, bounds checks, etc that doesn't need to be there). You can diagnose this is happening if you see lots of calls to broadcast.jl, materialize!: line ### in the @profview profiling window originating from lines with @. inside Integrators.jl, #step!.

For my benchmark (not compatible with Sunny#main) I get a modest speedup (attached screenshot)

Lazersmoke · 2024-04-30T18:25:27Z

Minimal code to reproduce:

using Sunny, LinearAlgebra, Chairmarks

cryst = Crystal(I(3), [[0.,0,0]],1)
sys = System(cryst,(1,1,1), [SpinInfo(1;S=1/2, g=2.5)], :dipole)
set_external_field!(sys,[0,0.3,0])
int = ImplicitMidpoint(0.001; damping=0.1, atol=1e-12)
@b (for j = 1:100_000; step!(sys,int); end)

kbarros · 2024-04-30T22:24:36Z

This is an interesting observation. To some extent I can reproduce the slowdown.

Here is a simpler stand-alone test we could report to Julia developers. Interestingly, it is not always the case that broadcasting is bad. In fact, sometimes it beats for loops.

using StaticArrays, Chairmarks
const Vec3 = SVector{3, Float64}

# An extreme case: Broadcasting over vectors of length 1.
Δs = randn(Vec3, 1)
s = randn(Vec3, 1)
∇E = randn(Vec3, 1)

# For loops win for this particular kernel
function f1(Δs, s, ∇E)
    for iters in 1:10_000
        for i in eachindex(Δs)
            Δs[i] = -s[i] × (s[i] × ∇E[i])
        end
    end
end
function f2(Δs, s, ∇E)
    for iters in 1:10_000
        @. Δs = -s × (s × ∇E)
    end
end
@b f1(Δs, s, ∇E) # 28.625 μs
@b f2(Δs, s, ∇E) # 34.041 μs


# Broadcasting (very slightly) wins for a simpler kernel
function g1(Δs, s, ∇E)
    for iters in 1:10_000
        for i in eachindex(Δs)
            Δs[i] = -s[i] × ∇E[i]
        end
    end
end
function g2(Δs, s, ∇E)
    for iters in 1:10_000
        @. Δs = -s × ∇E
    end
end
@b g1(Δs, s, ∇E) # 25.500 μs
@b g2(Δs, s, ∇E) # 23.958 μs

Lazersmoke · 2024-04-30T22:55:51Z

This appears to depend mainly on the number of [i] (i.e. how many bounds checks). For our case at least, it can be very efficient to use MVector for Δs and SVector for s; I get 4x speedup on g1 in Kip's benchmark. But it means we have to get the typing information in there somewhere. I'm not sure we want to store this in the type parameters of e.g. System, but if there's a way to convert sys.dipoles into an MVector before passing to rhs_dipole!, that would be highly highly advantageous.

kbarros · 2024-07-10T23:18:46Z

Might be worth also experimenting with this: https://github.com/YingboMa/FastBroadcast.jl.

de-broadcast dipole integration

1226a3b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration performance optimization: de-broadcast dipole integration #258

Integration performance optimization: de-broadcast dipole integration #258

Lazersmoke commented Apr 30, 2024

Lazersmoke commented Apr 30, 2024 •

edited by kbarros

Loading

kbarros commented Apr 30, 2024

Lazersmoke commented Apr 30, 2024

kbarros commented Jul 10, 2024

Integration performance optimization: de-broadcast dipole integration #258

Are you sure you want to change the base?

Integration performance optimization: de-broadcast dipole integration #258

Conversation

Lazersmoke commented Apr 30, 2024

Lazersmoke commented Apr 30, 2024 • edited by kbarros Loading

kbarros commented Apr 30, 2024

Lazersmoke commented Apr 30, 2024

kbarros commented Jul 10, 2024

Lazersmoke commented Apr 30, 2024 •

edited by kbarros

Loading